August 2019, UC Berkeley
Dana Seidel (built off of material by Kellie Ottoboni and Chris Krogslund)
ggplot2, and latticeAnd here’s some motivation - we can produce a plot like this with a few lines of code.
(Compare to the famous gapminder plot.)
The general call for base plot looks something like this:
Additional parameters can be passed in to customize the plot:
More layers can be added to the plot with additional calls to lines, points, text, etc.
gapChina <- gap %>% filter(country == "China")
plot(gapChina$year, gapChina$gdpPercap)
plot(gapChina$year, gapChina$gdpPercap, type = "l",
main = "China GDP over time",
xlab = "Year", ylab = "GDP per capita") # with updated parameters
points(gapChina$year, gapChina$gdpPercap, pch = 16)
points(x = 1977, y = gapChina$gdpPercap[gapChina$year == 1977],
col = "red", pch = 16)These are a variety of other types of plots you can make in base graphics.
boxplot(lifeExp ~ year, data = gap)
hist(gap$lifeExp[gap$year == 2007])
plot(density(gap$lifeExp[gap$year == 2007]))
barplot(gapChina$pop, width = 4, names.arg = gapChina$year,
main = "China population")lattice and ggplot2 generally don’t exhibit this sort of behaviorgap_lm <- lm(lifeExp ~ log(gdpPercap) + year, data = gap)
# Calls plotting method for class of the dataset ("data.frame")
plot(gap[,c('pop','lifeExp','gdpPercap')])
# Calls plotting method for class of gap_lm object ("lm"), print first two plots only
plot(gap_lm, which=1:2)ggplot2, and latticeBase graphics is
good for exploratory data analysis and sanity checks
inconsistent in syntax across functions: some take x,y while others take formulas
defaults plotting parameters are ugly, and it can be difficult to customize
that said, one can do essentially anything in base graphics with some work
ggplot2 is
generally more elegant
more syntactically logical (and therefore simpler, once you learn it)
better at grouping
able to interface with maps
lattice is
faster than ggplot2 (though only noticeable over many and large plots)
simpler than ggplot2 (at first)
better at trellis graphs than ggplot2
able to do 3d graphs
We’ll focus on ggplot2 as it is very powerful, very widely-used and allows one to produce very nice-looking graphics without a lot of coding.
ggplot2The general call for ggplot2 graphics looks something like this:
Note that ggplot2 graphs in layers in a continuing call (hence the endless +…+…+…), which really makes the extra layer part of the call.
You can see the layering effect by comparing the same graph with different colors for each layer
p <- ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
geom_point(color = "red")
p
p + geom_point(aes(x = year, y = lifeExp), color = "gray") + ylab("life expectancy") +
theme_minimal()And, if you’re desperate for the quick and dirty functionality of base plot, or just like the more familiar syntax at first, ggplot2 offers the qplot() function as a wrapper for most basic plots:
qplot(x = year, y = lifeExp, data = gapChina)
qplot(x = year, y = lifeExp, data = gapChina, geom = "line")ggplot2 syntax is very different from base graphics and lattice. It’s built on the grammar of graphics. The basic idea is that the visualization of all data requires four items:
One or more statistics conveying information about the data (identities, means, medians, etc.)
A coordinate system that differentiates between the intersections of statistics (at most two for ggplot, three for lattice)
Geometries that differentiate between off-coordinate variation in kind
Scales that differentiate between off-coordinate variation in degree
ggplot2 allows the user to manipulate all four of these items through the stat_*, coord_*, geom_*, and scale_* functions.
All of these are important to truly becoming a ggplot2 master 🧙♂️ 😉, but today we are going to focus on the most important to basic users and their data layers: ggplot2’s geometries
# Scatterplot
ggplot(gapChina, aes(x = year, y = lifeExp)) + geom_point() +
ggtitle("China's life expectancy")
# Line (time series) plot
ggplot(gapChina, aes(x = year, y = lifeExp)) + geom_line() +
ggtitle("China's life expectancy")
# Boxplot
ggplot(gap, aes(x = factor(year), y = lifeExp)) + geom_boxplot() +
ggtitle("World's life expectancy")
# Histogram
gap2007 <- gap %>% filter(year == 2007)
ggplot(gap2007, aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
ggtitle("World's life expectancy")ggplot2 and tidy dataggplot2 plays nice with dplyr and pipes. If you want to manipulate your data specifically for one plot but not save the new dataset, you can call your dplyr chain and pipe it directly into a ggplot call.# This combines the subsetting and plotting into one step
gap %>% filter(year == 2007) %>%
ggplot(aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
ggtitle("World's life expectancy")ggplot2 have one big difference: ggplot2 requires your data to be in tidy format. For base graphics, it can actually be helpful not to have your data in tidy format.For example, here ggplot treats country as an aesthetic parameter that differentiates groups of values, whereas base graphics treats each (year, medal) pair as a set of inputs to the plot.
Here’s ggplot with the data in a tidy format.
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
ggplot(data = gap, aes(x = year, y = lifeExp)) +
geom_line(aes(color = country), show.legend = FALSE)Is that a useful plot?
And here’s use of base graphics, taking advantage of non-tidy, wide-formatted data.
# Base graphics call
gap_wide <- gap %>% select(country, year, lifeExp) %>% spread(country, lifeExp)
gap_wide[1:5, 1:5]## year Afghanistan Albania Algeria Angola
## 1 1952 28.801 55.23 43.077 30.015
## 2 1957 30.332 59.28 45.685 31.999
## 3 1962 31.997 64.82 48.303 34.000
## 4 1967 34.020 66.22 51.407 35.985
## 5 1972 36.088 67.69 54.518 37.928
plot(gap_wide$year, gap_wide$China, col = 'red', type = 'l', ylim = c(40, 85))
lines(gap_wide$year, gap_wide$Turkey, col = 'green')
lines(gap_wide$year, gap_wide$Italy, col = 'blue')
legend("right", legend = c("China", "Turkey", "Italy"),
fill = c("red", "blue", "green"))Of course, as mentioned above, you can always filter your tidy data to replicate this plot with ggplot2…
gap %>%
filter(country %in% c("China", "Turkey", "Italy")) %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(color = country))ggplot2ggplot2 geomsWe’ve already seen these initial ones.
X-Y scatter plots: geom_point()
X-Y line plots: geom_line() or geom_path()
Histograms: geom_histogram(), geom_col(), or geom_bar()
gap2007 <- gap %>% filter(year == 2007)
ggplot(gap2007, aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
ggtitle("World's life expectancy")Densities: geom_density(), geom_density2d()
Boxplots: geom_boxplot()
# Notice that here, you must explicitly convert numeric years to factors
ggplot(data = gap, aes(x = factor(year), y = lifeExp)) +
geom_boxplot() “Trellis” plots: facet_grid() or facet_wrap()
Contour plots: geom_contour()
data(volcano) # Load volcano contour data
volcano[1:10, 1:10] # Examine volcano dataset (first 10 rows and columns)## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 100 100 101 101 101 101 101 100 100 100
## [2,] 101 101 102 102 102 102 102 101 101 101
## [3,] 102 102 103 103 103 103 103 102 102 102
## [4,] 103 103 104 104 104 104 104 103 103 103
## [5,] 104 104 105 105 105 105 105 104 104 103
## [6,] 105 105 105 106 106 106 106 105 105 104
## [7,] 105 106 106 107 107 107 107 106 106 105
## [8,] 106 107 107 108 108 108 108 107 107 106
## [9,] 107 108 108 109 109 109 109 108 108 107
## [10,] 108 109 109 110 110 110 110 109 109 108
volcano3d <- melt(volcano) # Use reshape2 package to melt the data into tidy form
head(volcano3d) # Examine volcano3d dataset (head)## Var1 Var2 value
## 1 1 1 100
## 2 2 1 101
## 3 3 1 102
## 4 4 1 103
## 5 5 1 104
## 6 6 1 105
names(volcano3d) <- c("xvar", "yvar", "zvar") # Rename volcano3d columns
ggplot(data = volcano3d, aes(x = xvar, y = yvar, z = zvar)) +
geom_contour() tile/image/level plots, heatmaps: geom_tile(), geom_rect(), geom_raster()
ggplot2ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10()
# Add linear model (lm) smoother
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
geom_smooth(method = "lm")
# Add local linear model (loess) smoother, span of 0.75 (more smoothed)
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
geom_smooth(method = "loess", span = .75)
# Add local linear model (loess) smoother, span of 0.25 (less smoothed)
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
geom_smooth(method = "loess", span = .25)
# Add linear model (lm) smoother, no standard error shading
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
geom_smooth(method = "lm", se = FALSE)
# Add local linear model (loess) smoother, no standard error shading
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
geom_smooth(method = "loess", se = FALSE)aes()These four aesthetic parameters (color, linetype, shape, size) can be used to show variation in kind (categories) and variation in degree (numeric).
Parameters passed into aes should be variables in your dataset.
Parameters passed to geom_xxx outside of aes should not be related to your dataset – they apply to the whole figure.
ggplot(data = gap, aes(x = year, y = lifeExp)) +
geom_line(aes(color = country), show.legend = FALSE)Note what happens when we specify the color parameter outside of the aesthetic operator. ggplot2 views these specifications as invalid graphical parameters.
## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomLine, : object 'country' not found
## Error in grDevices::col2rgb(colour, TRUE): invalid color name 'country'
## this works but only makes sense if we restrict to one country
ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
geom_line(color = "red")Note: Aesthetics automatically show up in your legend, parameters (those not mapped to a variable in your data frame) do not!
Differences in kind
## color as the aesthetic to differentiate by continent
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent)) + scale_x_log10()
## point shape as the aesthetic to differentiate by continent
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(shape = continent)) + scale_x_log10()
## line type as the aesthetic to differentiate by country
gapOceania <- gap %>% filter(continent %in% 'Oceania')
ggplot(data = gapOceania, aes(x = year, y = lifeExp)) +
geom_line(aes(linetype = country)) + scale_x_log10()Differences in degree
## point size as the aesthetic to differentiate by population
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop)) + scale_x_log10()
## color as the aesthetic to differentiate by population
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = pop)) + scale_x_log10() +
scale_color_gradient(low = 'lightgray', high = 'black')Multiple non-coordinate aesthetics (differences in kind using color, degree using point size)
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, color = continent)) + scale_x_log10()Aesthetics are handled by their very own scale functions which allow you to set the limits, breaks, tranformations, and any palletes that might determine how you want your data plotted. ggplot2 includes a number of helpful default scale functions like scale_x_log10 that can tranform your data on the fly or scale_color_viridis which uses palettes from the viridis package specifically designed to “make plots that are pretty, better represent your data, easier to read by those with colorblindness, and print well in grey scale.”
For example, our data might be better represented using a log10 transformation of per capita GDP:
And perhaps we want colors that are a little different:
ggplot(gap, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent)) +
scale_x_log10() +
scale_color_viridis_d()Or perhaps we want to set your palettes and breaks or labels manually:
ggplot(gap, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent)) +
scale_x_log10(labels = scales::dollar) +
scale_color_manual("Continent",
values = c("red", "blue", "green", "yellow", "#800080")) # hex codes work!For more info about setting scales in ggplot2 and for more helper functions consider diving into the scales package which is the backend to much of the scales functionality in ggplot2
ggplot handles many plot options as additional layers.
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() +
xlab(label = "GDP per capita") +
ylab(label = "Life expectancy") +
ggtitle(label = "Gapminder") Or even more simply use the labs() function
ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() +
labs(x = "GDP per capita", y = "Life expectancy", title = "Gapminder")ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
geom_point(size=3)
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
geom_point(size=1) ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
geom_point(color = colors()[11])
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
geom_point(color = "red") ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
geom_point(shape = 3)
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
geom_point(shape = "w")
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
geom_point(shape = "$", size=5) ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
geom_line(linetype = 1)
ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
geom_line(linetype = 2)
ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
geom_line(linetype = 5, size = 2) ggplot2Elements of the plot not associated with geometries can be adjusted using ggplot themes.
There are some “complete” themes already included with the package: - theme_gray() (the default) - theme_minimal() - theme_bw() - theme_light() - theme_dark() - theme_classic()
But in additional to these, you can tweak just about any element of your plot’s appearance using the theme() function.
For instance, perhaps you want to move the legend from the left to the bottom of your plot, this would be part of the plot theme. Note how you can add options to a complete theme already in the plot:
gap %>%
filter(country %in% c("China", "Turkey", "Italy")) %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(color = country)) +
theme_minimal() +
theme(legend.position = "bottom")ggplot2 graphs can be combined using the grid.arrange() function in the gridExtra package# Initialize gridExtra library
library(gridExtra)
# Create 3 plots to combine in a table
plot1 <- ggplot(data = gap2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point() + scale_x_log10() + annotate('text', 150, 80, label = '(a)')
plot2 <- ggplot(data = gap2007, aes(x = pop, y = lifeExp)) +
geom_point() + scale_x_log10() + annotate('text', 1.8e5, 80, label = '(b)')
plot3 <- ggplot(data = gap, aes(x = year, y = lifeExp)) +
geom_line(aes(color = country), show.legend = FALSE) +
annotate('text', 1951, 80, label = '(c)')
# Call grid.arrange
grid.arrange(plot1, plot2, plot3, nrow=3, ncol = 1)patchwork: Combining Multiple ggplot2 plotspatchwork package may be used to combine multiple ggplot2 plots using a small set of operators similar to the pipe.gridExtra and allows complex arrangements to be built nearly effortlessly.# Install and initialize patchwork library
# devtools::install_github("thomasp85/patchwork")
library(patchwork)
# use the patchwork operators
# stack plots horizontally
plot1 + plot2 + plot3
# stack plots vertically
plot1 / plot2 / plot3
# side-by-side plots with third plot below
(plot1 | plot2) / plot3
# side-by-side plots with a space in between, and a third plot below
(plot1 | plot_spacer() | plot2) / plot3
# stack plots vertically and alter with a single "gg_theme"
(plot1 / plot2 / plot3) & theme_bw()Feel free to explore more at https://github.com/thomasp85/patchwork.
Note: patchwork is an example of a ggplot2 extension package of which there are many! One of the benefits to learning and using ggplot2 is that there is a huge community of developers that build separate graphics packages that generally use the same syntax to extend the ggplot2 functionality into things like animation and 3D plotting! Check them out –> http://www.ggplot2-exts.org/gallery/
Two basic image types:
Every pixel of a plot contains its own separate coding; not so great if you want to resize the image
Every element of a plot is encoded with a function that gives its coding conditional on several factors; great for resizing
ggplotThese questions ask you to work with the gapminder dataset.
Plot a histogram of life expectancy.
Plot the gdp per capita against population. Put the x-axis on the log scale.
Clean up your scatterplot with a title and axis labels. Output it as a PDF and see if you’d be comfortable with including it in a report/paper.
Create a trellis plot of life expectancy by gdpPercap scatterplots, one subplot per continent. Use a 2x3 layout of panels in the plot. Now have the size of the points vary with population. Use scale_x_continuous() to set the x-axis limits to be in the range from 100 to 50000.
Make a boxplot of life expectancy conditional on binned values of gdp per capita.
Using the data for 2007, recreate as much as you can of this famous Gapminder plot, where the colors are different continents. (Don’t worry about the ‘2015’ in the background and ignore the ‘play’ button at the bottom.)
Create a “trellis” plot where, for a given year, each panel uses a) hollow circles to plot lifeExp as a function of log(gdpPercap), and b) a red loess smoother without standard errors to plot the trend. Turn off the grey background. Figure out how to use partially-transparent points to reduce the effect of the overplotting of points.